nw 


Contract  NJB0014-75-C-046W 

Office  oT~?Taval  Research  and  Brown  University 


Abduction  Macmnes 


This  paper  deals  with  a model  of  language  syntax  acquisition 


It  is  assumed  that  the  artificial  1 


ion 


which  we  hope  to  discover  on  the  basis  of  a finite  sample  of  sentences 


Specifically  excluded  is  the  simple  formulation  of  the  observations 


actually  made,  although  a naive  description  is  highly  desirable.  In 


the  terminology  of  learning  models,  an  insightful  model  is  to  b 


over  the  rote  learning  exemplified  in  a list 


Pattern  Conception"  by  Miller  and  Chomsky  [1957] 


tate  automata  models.  An  early  aesc 


on  the  virtues  of  finite 


n [1957] 


given  by  Solomonof 


a machine 


The  various  guises  and  disguises  of  this  problem  arc  found  in  artificial 
intelligence,  human  cognitive  studies,  pattern  recognition,  linguistics 
and  in  systems  theory  under  labels  such  as  inductive  inference,  automaton 
identification  and  grammatical  inference.  The  fine  exposition  by  Fu  [1974] 
in  a chapter  entitled  "Grammatical  Inference  for  Syntactic  Pattern  Recognition" 
surveys  many  approaches  and  also  contains  49  references.  Additional 
references  can  be  found  in  Trakhtcnbrot  anc  Eared in'  [1973]. 

As  far  as  modelling  of  acquisition  is  concerned , little  attention 

has  been  focussed  on  possible  natural  solutions.  The  following  investi- 

r 

gation  is  motivated  toward  the  presentation  of  model  models.  The  investi- 
gation follows  the  paths  initiated  by  Grenander  [1974]  in  his  abductory 
induction  models.  If  wc  restrict  consideration  to  regular  grammars, 
the  number  of  possible  grammars  is  overwhelmingly  large  for  even  small  size 
alphabets  and  modest  numbers  of  variables.  This  implies  chat  models  for 


analogy  to  real  world  phenomena  which  exhibit  language  acquisition 
ability  cannot  be  based  on  enumerative  inference  or  finite  search  techniques* 
for  this  reason,  we  also  seek  an  alternative  to  direct  implementations 
of  maximum  likelihood  solutions  which  select  one  grammar  over  another  on 
probabilistic  criteria  determined  by  the  solution  of  large  scale  linear 
systems . 


The  model  consists  of  two  components;  the  first  conyjonent  is  a teacher 
in  a dual  role.  The  teacher  acts  as  a generative  grammar  which  produces 
strings  in  a syntax-controlled  probability  language  (Grenander  [1967]). 

The  random  structure  of  the  language  is  induced  by  imposing  probabilities 
on  the  rules  of  the  grammar.  In  this  way,  the  production  of  some  strings 
can  be  inhibited  while  the  production  of  others  can  be  made  more  likely. 

This  adds  to  the  teacher's  grammar  a facet  of  linguistic  performance 
although  it  does  not  increase  the  generative  power  of  the  grammar. 

The  teacher  also  serves  as  an  acceptor  of  strings;  it  can  judge  whether 
or  not  a sentence  presented  to  it  by  the  learner  is  within  its  competence. 

The  second  component,  called  the  learner,  consists  of  two  principal 
procedures  which  carry  out  the  construction  of  a copy  of  the  teacher. 

The  first  procedure  or  phase  is  devoted  to  the  classification  of  words 
into  equivalence  classes  on  the  basis  of  grammatical  substitutability; 
this  procedure  is  in  priaiple  infinitary.  These  classes  are  familiar  in 
structural  linguistics  where  they  are  called  families  (Kulagina  [1958])  or 
categories  (Miller  and  Chomsky  [1963])  although  we  do  not  use  these 
classes  in  quite  the  same  way.  They  are  introduced  here  to  reduce  the 
combinatorial  complexity  of  the  constructions. 


The  first  phase  begins  with  the  assumption  that  all  words  are 


in  one  equivalence  class;  this  is  the  tabula  rasa  with  which  the  learner 
begins.  The  discovery  of  the  classes  is  carried  out  by  the  resolution 
of  the  dictionary  into  the  classes  which  form  the  required  partition. 

The  teacher  randomly  generates  strings  which  are  presented  to  the  learner 
as  a training  sequence.  For  each  string  in  the  training  sequence,  a 
word  is  selected  according  to  a weighted  probabilistic  strategy;  this 
might  be  thought  of  as  an  attention  function.  This  string  is  used  to 
either  strengthen  the  learner's  belief  about  the  selected  word's 
membership  in  its  present  grammatical  class  or  it  is  used  to  introduce 
a new  hypothesis  to  be  entertained  about  the  relation  of  this  word  to 
the  sought  for  partition  of  the  dictionary.  The  term  abduction,  introduced 
by  C.  S.  S.  Peirce  to  describe  the  starting  of  a hypothesis,  is  applied 
to  describe  this  process  which  either  changes  the  class  membership  of 
the  selected  word  or  forms  a new  class  with  this  selected  word.  All 
classes  formed  are  characterized  by  a fixed  representative  word  called  a 
prototype . 

The  second  phase  carries  out  the  discovery  of  the  syntactic  variables 
and  the  rewrite  rules  which  govern  these  variables.  Initially  in  this 
phase,  the  learner  has  a tabula  rasa  with  respect  to  variables;  the 
discovery  of  variables  proceeds  in  a analogous  manner  to  the  phase  one 
process.  Each  string  in  the  training  sequence  is  analyzed  to  seme  preset 
depth  to  determine  initial  string  equivalence  classes,  i.  e.  the  syntactic 
variables.  Each  initial  siring  can  be  decoded  with  respect  to  the  word 
class  partition  determined  in  phase  one  as  described  above.  This  indexing 
scheme  is  utilized  to  implent  efficiently  the  process  which  determines  the 


4 


partition  of  initial  strings  into  the  sought  for  syntactic  variables. 

A subsequent  encoding  of  initial  strings  is  a representation  of  the  rewrite 
rules.  A simple  tally  scheme  computes  the  experimental  frequency 
defined  probabilities. 

The  model  described  above  is  a blueprint  for  a language  discovery 
machine.  Such  a machine  has  been  implemented  in  /1F/A360;  this  machine 
has  been  tested  on  a fragment  English  grammar  and  on  several  formal 
grammars.  The  fragment  English  grammar  consists  of  87  rules  on  52  words 
in  23  classes;  the  rules  govern  the  18  syntactic  variables  in  this 
syntax-controlled  probability  grammar.  The  expected  sentence  length  is 
7.05  words.  In  a typical  experiment,  115  sentences  were  "listened  to" 
by  the  learner  to  determine  20  of  the  23  classes  and  to  correctly 
classify  48  of  the  words  in  the  dictionary.  After  27  sentences  were 
analyzed  to  a depth  of  8 words,  17  of  the  18  variables  were  discovered; 
when  the  depth  was  increased  to  15,  20  additional  sentences 
were  generated  before  the  remaining  variable  was  discovered.  The  graphs 
in  figures  (1)  and  (2)  illustrate  the  learning  characteristics  exemplified 

by  the  word  class  discovery  procedure. 


NUMBER  OF  DISCOVERED  WORD  CLASSES  v.  NUMBER  OF  SENTENCES 
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4.  The  Euclidean  norm  of  the  residual  vector  can  of  course  be 
reduced  by  the  direct  application  of  the  least  squares  procedure 
to  the  form  A -i-  Bexp(Cr)  with  the  observed  data.  The 
constrained  problem:  find  A,  B and  C to  minimize 

M =Y (z  - A -Bexp(Cr))2  subject  to  the  tabula  rasa  constraint 
•z  = A + Bcxp ( C ) can  be  readily  computed  as  follows: 
the  transcendental  equation  in  C 


(z_-l_)*  (v*  - [ ( v* v ' )/( v*  v)  Jv)  - 0 


v.-here  v =(exp(C)  -exp(Cr) ) and  v' 


dv 

dc 


is  solved  for  C 


and  this  value  is 
and  A=1  - Bexp(C) 
also  recorded  in 


used  to  determine  B = - v • (z-l)/(v- v) 

This  has  been  done  and  the  results  are 
table  1 and  depicted  graphically  in  figure  3. 


Figure  3( original  data  shown  in  red) 
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5 . A conjecture  based  on  visual  observation  is  that  the 
learning  characteristics  fall  between  the  values  determined 
in  section  3 and  section  the  sought  for  characteristics 
would  better  represent  both  the  early  (rapid)  learning  and 
the  asymptotic  qualities  vis.,  the  time  to  determine  ail  the 
words  correctly.  A heuristic  procedure  based  on  the  observed 
data  could  be  implemented;  this  v/ould  especially  be  of  use 
if  predictive  qualities  of  the  model  were  desired.  In  the 
least  squares  procedure,  a weighting  function  appropriately 
chosen  would  improve  the  fit.  We  mention  here  that  the  true 
asymptotic  value  (e.g.the  number  of  words  to  be  classified), 
which  is  known  a priori,  denoted  here  by  z_  could  be  introduced 
as  an  additional  constraint  to  reformulate  the  least  squares 
problem  as:  find  C to  minimise  2af(  z^-  za[l  - (l-l/za)exp(C(r-l) ) ])‘ 
This  leads  to  a transcendental  equation  in  C 


[(z  - lza)/(l  - za)  - cxp(Cu) J*u  = 0 


where  the  t-vector  u = (r-l) . 
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Davis,  P.  J.  [1963],  Interpolation  and  Approximation,  Blaisdell, 
Waltham . 
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1.  Consider  an  "incidence  matrix"  of  word  equivalents  whose 
entries  arc  updated  as  the  partitioning  procedure  is  carried 
out.  All  the  entries  are  initially  set  to  some  number  p^, 
O^p^l.  At  each  stage  of  the  procedure,  the  entry  corresponding 
to  the  pair  of  words  selected  for  testing  is  either  augmented 
or  set  to  zero  according  as  the  test  result  for  this  pair  is 
"believed  equivalent"  or  "not  equivalent"  respectively;  the 
other  entries  are  unchanged.  Thus  the  x,y  entry  in  the  matrix 
after  the  r+lst  stage  is 


covere 


(•)  denotes  a function  which  augments  entri 


believed  equivalent  case 


his  matrix  to  the  true 


2.  The  rate  of  convergence  o 


incidence  matrix  of  the  infinitary  equivalence  relation  will 


be  delayed  b, 


equivalence  and  getting  an  "incorrect"  result 


hich  involve  tho  word  x 


for  examole.  out  of  100  senten 


rammatical  with  the  word 


20  of  these  sentences  would  be  un, 


ubstituted  for 


of  the  number  of 


entonce 


e x from  y t 


p 


number  of  sentences  involving  x (in  the  case  that  x and  y 
are  not  equivalent)  be  considered  as  a candidate  for  the  afore- 
mentioned probability;  note  that  this  quantity,  which  will 
be  denoted  by  epsilon,  uepends  on  each  pair  of  words . If 
epsilon  is  one,  then  all  grammatical  sentences  involving  x 
separate  x from  y when  they  are  not  equivalent;  in  such  a case, 
the  non-equivalent  words  are  separated  as  soon  as  they  are 
presented  together  for  testing.  If  epsilon  is  zero,  then 
every  grammatical  sentence  involving  x is  also  grammatical 
with  y substituted  for  x and  hence  x and  y are  equivalent. 


3.  The  rate  of  convergence  will  also  depend  upon  the  frequency 
with  which  pairs  of  words  are  brought  forth  for  comparison  with 
respect  to  the  equivalence  relation.  For  fixed  words  x and  y 
which  are  not  equivalent,  it  is  of  interest  to  compute  the  rate 
at  which  the  corresponding  matrix  entry  p converges  to  zero. 
Since  the  underlying  process  is  probabilistic,  we  propose  to 
compute  the  mean  rate  at  which  such  an  entry  converges  to  zero. 
This  will  indicate  how  to  estimate  the  mean  time  to  determine 


the  true  incidence  matrix  and  hence  the  mean  time  to  determine 
all  non-equivalent  words. 

Let  c. . denote  the  probability  that  words  x and  y are  brought 
xy 

forth  for  comparison.  Then  the  expected  value  of  the  sum  of 

the  possible  entries  p (r)  for  non-equivalent  words  is 

xy 

estimated  by 


k 


E[  2 P (r)]  = 2 E[p^,(r)  ] ^ (no. of  non-eq.  was.)* 

*3 up  y E[p  (r) ] V . 

x,y  L J J 


4.  The  augmentation  function  f(*  ) is  a concave  function  which 
increases  an  entry  in  the  matrix  from  the  initial  (tabula  rasa) 
value  p^  to  a value  according  as  the  strength  of  belief  of 
equivalence  of  the  corresponding  pair  of  words  increases. 

The  r-th  iterate  of  this  function  which  enters  the  estimate 
of  the  rate  of  convergence  behaves  asymptotically  like  the 
function  1 - abr,  as  will  be  shown  below;  such  a choice  is 
natural  for  the  function  which  is  to  indicate  increasing  strength 
of  belief  in  the  word  class  equivalence  of  words. 

Let  fr*(x)  denote  the  r-th  iterate  of  the  function  f(x)  which 
maps  the  interval  [0,1]  into  itself;  moreover,  assume  that 
f(x)  is  continuous,  that  f'(x)  exists  in  (0,l)  and  that  f(l)=l. 
Then  by  the  successive  application  of  the  mean  value  theorem, 

frhl#(x)  = f[ fr*(x)]  =f[fr*(x)]  - f(l)  + f(l) 


= i - S/(D  - f[fr>'u)]} 

- 1 - kr[l  - fr*(x)] 

= 1 - kjl  - [1  - k (1  - fr"1*(x))]\ 

n n-l  J 


(Vr-l  " 


■itjHi  - r(x)] 


.n  l 

ab 


where  the 


k.  are  constants  (values  of  the  derlvate  of  f(*) 
x 

at  the  appropriate  Intermediate  value) . 

5.  Suppose  that  we  center  attention  on  a fixed  pair  of  non- 
equivalent  words  x and  y.  After  t sentences  in  the  partitioning 
procedure  have  been  processed,  suppose  that  k of  them  produced 
the  pair  for  comparison:  then  the  x,y  entry  in  the  matrix 

f fk*(Pl ) 

= p(t)-{  °r  ‘ , 

-v  1 0 

where  fk"(*)  denotes  the  k-th  iterate  of  the  augmentation  function 
This  entry  is  non-zero  in  the  case  tint  k trials  took  place 
with  x and  y believed  equivalent;  each  such  trial  has  probability 
d = 1 - £„  , where  epsilon  was  described  in  section  2.  Hence, 

xy 

E[p(t)  : of  which  k trials  involve  x,y]  = fk ' (p1)dk, 
v;hcre  t . 


6.  Recall  that  c denotes  the  probability  that  words  x and  y 
are  brought  forth  for  comparison;  then  for  the  t trials 
performed,  we  have 

Efp(t)]  - (I  - of  + tc(l  - cf'hUpp  i-  ...-I-  otdtft*(p1) 

» iftb  ="(1  - cjt-Wtpp 

fe-0 

(cr])k  (1  - c)s'k  [1  - a bkJ 

k.--o 


[(cd)  + ( 1 -c ) i u - a[(cdb)  + (1-c)] 
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thus , 


[ ( cd ) + (l  - c)]°,  since  (Vb^l; 


:[p(t)  ] — [1  - c(i-d)  ] 


[i  - cer  . 


7.  Per  a given  pair  of  words  which  is  equivalent,  after  t 
sentences  the  x,y  entry  will  be 


P.„.(t)  - p(t)  = f'MPn  ) 

■*N>  — 

for  1c  sentences  (out  of  the  t)  which  involve  this  pair, 


Then  the  mean  value  of  the  entry  is 


E[p(t)]  - X(‘)  ek(l  - c)t-kfk*(p1) 

fe--o  ‘v 

~ 1 - a[(cb)  -i-  (1  - c)  ]t 


8.  To  summarise,  after  t sentences  have  been  generated  lc  of  which 
involve  the  pair  x,y,  the  corresponding  entry  in  this  "incidence 
matrix"  has  mean  value  for  largo  number  of  trials  t given  by 


:[p:rv]' 


Cl  - a[l  - c ( 1 - b)  if  x=y 


:i  - oe  r 
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Introduction 

This  paper  deals  with  a model  of  language  syntax  acquisition. 
It  is  assumed  that  the  artificial  language  has  a finite  description 
which  we  hope  to  discover  on  the  basis  of  a finite  sample  of 
sentences.  Specifically  excluded  is  the  simple  formulation  of  the 
observations  actually  made,  although  a naive  description  is  highly 
desirable.  In  the  terminology  of  learning  models,  an  Insightful 
model  is  to  be  preferred  over  the  rote  learning  exemplified  in  a 
list . 


Proposal  for  the  study  of  this  problem  was  posed  in  the 
paper  "Pattern  Conception"  by  Milier  and  Chomsky  [19157];  this 
paper  elaborated  on  the  virtues  of  finite  state  automaton  models. 
An  early  description  of  a machine  to  carry  out  grammar  discovery 
is  given  by  Solomonoff  in  [1957]-  Variations  of  this  problem 
are  found  in  artificial  intelligence,  human  congnitive  studies, 
pattern  recognition,  linguistics,  and  in  systems  theory  under 
labels  such  as  inductive  Inference,  automaton  identification  and 
grammatical  inference.  The  fine  exposition  by  Fu  [197*0  in  a 
chapter  entitled  "Grammatical  Inference  for  Syntactic  Pattern 
Recognition"  surveys  many  approaches  and  also  contains  *19 
references.  Additional  references  can  be  found  in  Trakhtenbrot 
and  Barzdin  [19731*  More  recent  references  are  Adleman  and 
Blum  [1975]  which  deals  with  degrees  of  unsolvability  of  inductive 
inference  problems  and  Angluin  [1976]  which  explores  complexity 
of  the  inference  of  finite  state  grammars  from  a finite  set  of 
negative  sample  strings. 


positive  arid 
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The  following  investigation  is  motivated  toward  the  pre- 
sentation of  natural  models.  The  investigation  follows  the  paths 
initiated  by  Crenander  [ 1 9 7 2l  1 in  his  abductory  induction  models. 

If  we  restrict  consideration  to  regular  grammars,  the  number  of 
possible  grammars  is  overwhelmingly  large  for  even  small  size 
alphabets  and  modest  numbers  of  variables.  This  implies  that 
models  for  analogy  to  real  world  phenomena  which  exhibit  language 
acquisition  ability  cannot  be  based  on  enumerative  inference  or 
finite  search  techniques;  for  this  reason,  we  also  seek  an 
alternative  to  direct  implementations  of  maximum  likelihood 
solutions  which  select  one  grammar  over  another  on  probabilistic 
criteria  by  the  solution  of  large  scale  linear  systems. 

In  particular,  algorithms  are  presented  which  carry  out 
grammar  discovery  by  the  construction  of  the  (right  invariant) 
equivalence  classes  induced  by  the  finite  state  automaton.  The 
classes  are  established  by  means  of  a training  sequence  and  a 
teacher.  The  number  of  possible  partitions  induced  by  an 
equivalence  relation  is  known  to  be  given  by  sums  of  Stirling 
numbers  of  the  second  kind.  To  reduce  the  combinatorial  complexity 
of  the  problem,  an  equivalence  relation  defined  on  the  word 
dictionary  in  a natural  way  by  grammatical  substitutability  is 
used  to  partition  this  dictionary  into  grammatical  equivalence 
classes.  The  prototype  of  each  word  class,  that  is,  the  repre- 
sentor of  each  of  the  established  classes,  is  then  used  to  carry  i 

out  the  synthesis  of  the  minimal  state  automaton.  The  algorithms 
are  imbedded  in  a statistical  environment;  they  are  studied  experi- 

i <^i 

mentally  and  theoretically.  Sp*  ■ ial  attention  is  focussed  on  the 


r 


flexibility  of  the  algorithms  to  accommodate  non-systcmatic 
errors  (e.g.  a teacher  which  occasionally  makes  errors). 

The  model  consists  of  two  components:  the  first  component 

is  a teacher  in  a dual  role.  The  teacher  acts  as  a generative 
grammar  which  produces  strings  in  a syntax-controlled  probability 
language  (Grenander  [1967])-  The  random  structure  of  the  language 
is  induced  by  imposing  probabilities  on  the  rules  of  the  grammar. 
In  this  way,  the  production  of  some  strings  can  be  inhibited  while 
the  production  of  others  can  be  made  more  likely.  This  adds  to 
the  teacher's  grammar  a facet  of  linguistic  performance  although 
it  does  not  increase  the  generative  power  of  the  grammar.  The 
teacher  also  serves  as  an  acceptor  of  strings;  it  can  judge 
whether  or  not  a sentence  presented  to  it  by  the  learner  is  within 


its  competence. 

The  second  component,  called  the  learner,  consists  of  two 
principal  procedures  which  carry  out  the  construction  of  a copy 
of  the  teacher.  The  first  procedure  or  phase  is  devoted  to  the 
classification  of  words  Into  equivalence  classes  on  the  basis  ci 
grammatical  substitutability;  this  procedure  is  in  principle 
infinitary.  These  classes  are  familiar  in  structural  linguistics 
where  they  are  called  families  (Kulagina  [1953])  or  categories 
(Miller  and  Chomsky  [1963])  although  we  do  not  use  these  classes 
in  quite  the  same  way.  They  are  introduced  here  to  reduce  the 
combinatorial  complexity  of  the  constructions. 

The  first  phase  begins  with  the  assumption  that  all  words 
are  in  orv  equivalence  class.:  this  is  the  tabula  ra sa  with  which 


the  learner  begins.  The  discovery  of  the  classes  is  carried  out 
by  the  resolution  of  the  dictionary  into  the  classes  which  form 
the  required  partition.  The  teacher  randomly  generates  strings 
which  are  presented  to  the  learner  as  a training  sequence.  For 
each  string  in  the  training  sequence,  a word  is  selected  according 
to  a weighted  probabilistic  strategy;  this  might  be  thought  of  as 
an  attention  function.  This  string  is  used  to  either  strengthen 
the  learner's  belief  about  the  selected  word's  membership  in  its 
present  grammatical  class  or  it  is  used  to  introduce  a new 
hypothesis  to  be  entertained  about  the  relation  of  this  word  to 
the  sought  for  partition  of  the  dictionary.  The  term  abduction, 
introduced  by  C.S.5.  Peirce  [1931]  to  describe  the  starting  of  a 
hypothesis,  is  applied  to  describe  this  process  which  either  changes 
the  class  membership  of  the  selected  word  or  forms  a new  class  with 
this  selected  word.  All  classes  formed  arc  characterized  by  a 
fixed  representative  word  called  a prototype . 

The  second  phase  carries  out  the  discovery  of  the  syntactic 
variables  and  the  rewrite  rules  which  govern  these  variables. 
Initially  in  this  phase,  the  learner  has  a tabula  rasa  with 
respect  to  variables;  the  discovery  of  variables  proceeds  in  a 
manner  analogous  to  the  phase  one  process.  Each  string  in  the 
training  sequence  is  analyzed  to  some  preset  depth  to  determine 
initial  string  equivalence  classes,  i.e.  the  syntactic  variables. 
Each  string  can  be  decoded  with  respect  to  the  word  class  partition 
determined  in  phase  one  as  described  above.  This  indexing  scheme 
is  used  to  implement;  efficiently  the  process  which  determines  the 


partition  of  initial  strings  into  the  sought  for  syntactic 


variables.  A subsequent  encoding  of  initial  strings  is  a 
I’epresentation  of  the  rewrite  rules.  A simple  tally  scheme 
computes  the  experimental  frequency-defined  probabilities. 

The  model  described  above  is  a blueprint  for  a language 
discovery  machine.  Such  a machine  has  been  implemented  in  A 
programming  Language  (APL) : this  machine  has  been  tested  on 

a fragment  English  grammar  and  on  several  formal  grammars. 

The  fragment  English  grammar  consists  of  87  rules  on  52  'words 
in  23  classes;  the  rules  govern  the  18  syntactic  variables  in 
this  syntax-controlled  probability  grammar.  The  teacher-learner 
interaction  is  protrayed  with  no  explicit  semantics  and  no 


environment'-.  That  is,  (context)  semantics  and  pragmatism  are 
contained  in  neither  the  teacher  or  learner  nor  the  training 
sequence.  The  language  strings  appear  to  have  a semantic  aspect: 
this  is  built  into  the  syntactic  rewrite  rules.  The  expected 
sentence  length  (computed  from  the  mathematical  model)  is  7.05 
words.  In  a typical  experiment,  115  sentences  were  heard  by  the 
learner  to  determine  20  of  the  23  classes  and  to  correctly 
classify  Jl8  of  the  words  in  the  dictionary.  After  27  sentences 
were  analyzed  to  a depth  of  8 words,  17  of  the  18  variables  were 
discovered;  when  the  depth  was  increased  to  15,  20  additional 
sentences  were  generated  before  the  remaining  variable  was 
discovered.  The  graphs  in  figures  (1)  and  (2)  illustrate  the 
learning  characteristics  exemplified  by  the  word  class  discovery 
procedure.  More  complete  experimental  results  and  a mathematical 
learning  model  appear  in  later  sections. 


Preliminaries 


The  following  sections,  which  present  no  new  results, 
contain  the  definitions  and  results  from  formal  language  theory 
and  automaton  theory  which  servo  as  research  background  for 
syntactic  abduction  of  linear  strings  presented  in  the  later 
chapters.  The  exposition  follows  Hopcroft  and  Ullinar.  [1969]. 
The  exposition  of  the  syntax-controlled  probabilities  follows 
Grenander  [196?]. 


Phrase  Struccur^  Languages 

1.  Let  Vr„  denote  a finite  set  of  symbols  called  terminals  or 
words;  let  Vrn  denote  the  set  of  all  finite  length  strings  of 

u. 

s 4. 

these  'words;  and  let  VT  denote  Vr„  UfMULL}  where  the  empty 
string  of  length  zero  is  denoted  by  NULL.  A language  is  any 
element  of  the  powerset  of  Y^,.  Those  (possibly  infinite) 
subsets  of  V^,  which  have  finite  generating  representations  are 
called  recursively  enumerable.  These  finite  generating  repre- 
sentations or  specifications  are  called  phrase  structure 
grammars  and  are  formulated  as  follows:  introduce  an  auxiliary 

finite  set  of  symbols,  denoted  by  VM,  called  non-terminals  or 
syntactic  variables,  with  a distinguished  symbol  S (G V . , ) ; 
introduce  a finite  set  of  rewrite  rules  R which  govern  these 
variables  and  which  is  a subset;  of  V*  xV'?:,  where  V denotes 
V uvrp.  Then  the  phrase  structure  language  consists  of  the 
strings  which  can  be  derived  from  S (the  start  symbol)  by 
successive  application  of  the  rules.  This  is  rigorously 
described  by  the  introduction  of  a relation  from  V+  to  Vs  as 

-f- 

follows:  for  any  uGV  and  vGV*,  u is  said  to  directly  derive 

v (in  the  grammar)  if  there  are  strings  i,j,x,y(.-Vs  such  that; 
u = xiy,  v = xjy  and  (i,j)€R.  This  can  be  extended  by  saying 
u derives  v in  the  grammar  if  either  u=v  or  if  there  is  a 
finite  sequence  Z , Z,  , Z-, , . . . , Z C Vs,  m > i,  such  that  u=Z  , 
v=Z,n,  and  Z.<l  directly  derives  Z.. +1  for  i = 0(l)m-l.  Then  the 
language  generated  by  this  grammar  is  defined  as  the  set  of 
strings  of  words  which  can  be  derived  from  the  start  symbol  S. 


If  S derives  u (denoted’ by  S •+ 
then  u is  called  a sentential 
the  set  of  rewrite  rules,  for 
customarily  also  written  as  i 
the  same  language,  then  they  a 


u)  and  u contains  variables, 
form.  Mote  that  the  elements  in 
example  denoted  by  (i,j),  are 
-+  y.  If  two  grammars  generate 
re  said  to  be  weakly  equivalent. 


2.  Context-Free  Languages.  If  the  rewrite  rules  R are 


restricted  to  finite  subsets  of  V,  *VS,  then  the  resultant 


grammar  is  called  a context-free  grammar. 


3.  Finite-State  Languages . Those  subsets  of  context-free 
grammars  to  which  we  focus  our  attention  are  called  finite  state 
grammars.  The  variables  are  governed  by  rewrite  rules  of  two 
types:  continuing  u ->-  xj  or  terminating  i -►  x,  where  i,j  6Vy 

and  xGVm.  Denote  the  finite  set  of  rules  which  rewrite  i by 


R and  assume  that  the  generating  algorithm  begins  by  the 


application  of  a rule  selected  from  R,  . 


?he  language  generated  by  such  a grammar,  which  consists 


of  Vrp,VM  and  R,  is  denoted  by  L(G);  the  symbol  G denotes  the 

i.  W 


triple  (Vn,V,,,R)  and  R denotes  the  finite  set  of  all  rules, 
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The  language  L(G)  consists  of  ail  those  strings  in  Vr.,  which  can 


be  produced  by  the  application  of  the  rules  in  H;  note  that  any 
string  in  L(G)  is  as  likely  to  appear  as  any  other  as  a result 
of  the  application  of  the  grammar  G. 


b . Relation  to  Finite  Gtato  Automata, 
by  a grammar  is  some  set  of  strings  as 


The  language  generated 
described  above.  This 


set  is  also  the  set  accepted  by  some  finite  state  automaton. 


The  identification  of  L(G)  with  its  finite  state  automaton 
acceptor  proceeds  as  follows: 

the  variables  in  V,.  constitute  the  states  l,2,...,n  ; in 

!•»  v 

addition,  introduce  a final  state  F which  is  the  "target"  state 
for  those  variables  which  are  governed  by  terminating  rewrite 


5.  Syntax-Controlled  Frobabll lilcs . For  each 

R.  = {r.  ,r.  ,...,r.  },  where  r..  denotes  a rule  and  n.  is 

1 1 1 1 ’ in  ’ i 

1 11  ■l2  i 

the  number  of  rules  rewriting  i introduce  a probability  distr.il 
tion  over  so  that 

Z 1 P(r,.)  = 1, 


where  1=1, 2, ... , n 


6.  Markov  Chain.  Consider  the  application  of  the  grammar  G 
together  with  the  probability  distribution.  In  terms  of  tine 
automaton  description,  the  probability  that  the  machine  will 
be  at  state  £ at  time  t+1  given  that  it  is  at  state  k ac  time 


t is  specified. 


system  which  evolves  through  a finite  number 


of  states  (n  +1)  with  a specified  conditional  probability  of 

V 

transition  between  two  states  for  a given  state  at  time  t whicn 
is  independent  of  t is  called  a finite  homogeneous  Markov  chain 
(Kemeny  and  Snell  [i960]).  The  familiar  state  diagram  for 
finite  automata  (with  labeled  arcs  which  indicate  letters  in  VT 
and  the  probability)  has  a description  in  terms  of  two  matrices 
a matrix  of  probabilities  and  a matrix  which  prescribes  the 
letters  which  correspond  to  transitions  between  states. 
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